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ABSTRACT 


For J > 2 independent groups, the paper deals with testing the global hypothesis that 
all J groups have a common population median or identical quantiles, with an emphasis on 
the quartiles. Classic rank-based methods are sometimes suggested for comparing medians, 
but it is well known that under general conditions they do not adequately address this goal. 
Extant methods based on the usual sample median are unsatisfactory when there are tied 
values except for the special case J = 2. A variation of the percentile bootstrap used in 
conjunction with the Harrell-Davis quantile estimator performs well in simulations. The 
method is illustrated with data from the Well Elderly 2 study. 

Keywords: Tied values; bootstrap methods; Harrell-Davis estimator; projection dis¬ 
tances; Well Elderly 2 study 


1 Introduction 

For J independent random variables, let 6j be the population median or some other quantile 
associated with jth variable (j = 1,..., J). The paper considers the problem of testing 

Ho-.ei = --- = ej, ( 1 ) 

particularly when there are tied values. The focus is on comparing the medians as well as 
the upper or lower quartiles, but it is evident that the results are relevant when comparing 
other quantiles instead. 

For the special case J = 2, the Wilcoxon-Mann-Whitney (WMW) test is sometimes 
suggested for comparing medians, but it is well known that under general conditions it does 
not accomplish this goal (e.g., Hettmansperger, 1984; Fung, 1980). Roughly, the reason is 
that for two independent random variables, X and Y, it is not based on an estimate of — 
but rather on an estimate of P{X < Y). Another concern is that when distributions differ, 
under general conditions the WMW test uses the wrong standard error (e.g.. Cliff, 1996; 
Wilcox, 2012). More generally, the Kruskall-Wallis test, which reduces to the Wilcoxon- 
Mann-Whitney test when J = 2, does not test (1). 
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Yet another possible approach is to use the usual sample median in conjunction with 
a permutation test. However, results in Romano (1990) establish that this approach is 
unsatisfactory as well. 

For a random sample Xi ,let X(i) < ... < X(„) denote the observations written 
in ascending order and let Mj denote the usual sample median. That is, if the number of 
observations, n, is odd, 

M = Y(m), 

where m = (n + l)/2 and if n is even, 

_ i^jm) + "^(m+1)) 

2 

where now m = n/2. A natural way of proceeding is to estimate 6j with Mj and use some 
test statistic that is based in part on some estimate of the standard error of Mj. When 
sampling from a continuous distribution where tied values occur with probability zero, an 
effective method was studied by Bonett and Price (2002) that can be used when J = 2 
or when J > 2 and the goal is to test some hypothesis based on a linear contrast of the 
population medians. Numerous methods for estimating the standard error of Mj have been 
derived, but extant results indicate that all of them can perform poorly when tied values 
can occur (Wilcox, 2012). Wilcox (2006) found a slight extension of a standard percentile 
bootstrap method that performs well when testing (1), there are tied values and J = 2. But 
Wilcox (2012) notes that in terms of testing (1) when J > 2, evidently no method has been 
found that performs well in simulations when there are tied values. 

There is another complication when working with the usual sample median. It is well 
known that when sampling from a continuous distribution, under certain regularity condi¬ 
tions, Mj is asymptotically normal. However, when sampling from a discrete distribution 
with a hnite sample space, Mj does not converge to a normal distribution. More broadly, 
when estimating quantiles using a single order statistic, or a weighted average of two or¬ 
der statistics, assuming asymptotic normality is generally unsatisfactory when dealing with 
discrete distributions where tied values occur. 


As an illustration, consider a beta-binomial distribution having the probability function 


P{x) 


B(m — X + r.iX + s) 

{m + l)B(m — a; -T 1, a; -I- l)B(r, s) ’ 


( 2 ) 
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where B is the complete beta function, r > 0 and s > 0 are parameters that determine the 
shape of the distribution and x = 0,... ,m. Consider m = 30. Then the cardinality of the 
sample space is 31 and as is evident, if the sample size is n > 31, tied values occur with 
probability one. The left panel of Figure 1 shows a plot of 3000 sample medians generated 
from a beta-binomial distribntion with r = 1 and s = 3 based on a sample size n = 100. 
(So the beta-binomial distribntion is skewed to the right, P{x) is monotonic decreasing 
and X = 6 corresponds to the .52 qnantile.) The right panel is the same as the left, only 
now n = 500. As is evident, the sampling distribntion has not moved closer to a normal 
distribntion and indeed the cardinality of the sample space has decreased, indicating that 
any method for making inferences based on the sample median that assnmes asymptotic 
normality can perform poorly. 

For the special case where the goal is to compare two independent gronps, a method for 
comparing qnantiles that deals effectively with tied values is to use a percentile bootstrap in 
conjnnction with the qnantile estimator derived by Harrell and Davis (1982); see Wilcox et 
al. (2013). The Harrell-Davis estimator uses a weighted average of all the order statistics. 
The result is a sampling distribntion that is typically well approximated by a continuous 
distribution. Consider again the right panel of Fignre 1 and note that the cardinality of the 
sample space is only hve. That is, only hve valnes for the sample median were observed 
among the 3000 estimates. In contrast, if the Harrell-Davis estimator is used, there are no 
tied valnes among all 3000 estimates. But the method studied by Wilcox et al. is limited 
to J = 2; there are no resnlts on how best to proceed when testing (1) and there are J > 2 
independent gronps. 

Here, two methods for testing (1) were considered, both of which were based on the 
Harrell-Davis estimator. The hrst was based on a test statistic mentioned by Schrader and 
Hettmansperger (1980), and stndied by He, Simpson and Portnoy (1990). The basic strategy 
was to nse a percentile bootstrap method to estimate the nnll distribntion. Bnt sitnations 
were fonnd where this approach performed very poorly when dealing with tied valnes, so 
fnrther details are omitted. The other method is described in section 2 and simnlation 
results are reported in section 3. Section 4 illustrates the proposed method using data from 
the Well Elderly 2 stndy. The strategy is not new and has been fonnd to perform reasonably 
well when dealing with other robust measures of location (Wilcox, 2012). However, when 
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using the usual sample median for the situation at hand, preliminary simulations found that 
it performs poorly in terms of controlling the Type I error probability. Results in Wilcox et 
ah (2013) suggest that using instead the Harrell-Davis estimator, reasonably good control 
over the Type I error probability might be obtained. So the goal here is to determine the 
extent to which this is the case. 

Note that in terms of characterizing the typical value of some random variable, the pop¬ 
ulation median is an obvious choice. However, situations are encountered where differences 
occur in the tails of a distribution that have substantive interest (e.g. Doksum & Sievers, 
1976; Wilcox et ah, 2014). This issue can be addressed by comparing quantiles other than 
the median, which can help provide a deeper understanding of how distributions differ. This 
point is illustrated in section 4. 


2 Description of the Proposed Method 


To describe the Harrell and Davis (1982) estimate of the gth quantile, let H be a random 
variable having a beta distribution with parameters a = (n -|- l)q and b = (n -|- 1)(1 — g). 
That is, the probability density function of Y is 


r(a + b) 

fWW) 




b-l 


, 0 <|/< 1 , 


where T is the gamma function. Let 

W, = P 



<Y < 



For a random sample Xi,... ,X„, let X(i) < ... < denote the observations written in 
ascending order. The Harrell-Davis estimate of 9q, the gth quantile, is 


^=1 


(3) 


In terms of its standard error, Sfakianakis and Verginis (2006) show that in some situa¬ 
tions the Harrell-Davis estimator competes well with alternative estimators that again use 
a weighted average of all the order statistics, but there are exceptions. (Sfakianakis and 
Verginis derived alternative estimators that have advantages over the Harrell-Davis in some 
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situations. But when sampling from heavy-tailed distributions, the standard errors of their 
estimators can be substantially larger than the standard error of 6 g.) Additional comparisons 
of various estimators are reported by Parrish (1990), Sheather and Marron (1990), as well 
as Dielman, Lowry and Pfaffenberger (1994). The only certainty is that no single estimator 
dominates in terms of efficiency. For example, the Harrell-Davis estimator has a smaller 
standard error than the usual sample median when sampling from a normal distribution or 
a distribution that has relatively light tails, but for sufficiently heavy-tailed distributions, 
the reverse is true (Wilcox, 2012, p. 87). 

Let 6 jk = Oj — 6 k- Roughly, the strategy for testing (1) is to test 

Ho ; 6 ,k = 0, Vj < k, (4) 

via a percentile bootstrap method. If the null hypothesis is true, then 0, a vector of zeros 
having length (J^ —J)/2, should be nested fairly deeply within a cloud of bootstrap estimates 
of 5jk- Moreover, the depth of 0 can be used to compute a p-value as will be indicated. 
A natural way of measuring the depth of 0 within a bootstrap cloud is via Mahalanobis 
distance. Note, however, that the 5jk parameters are linearly dependent. This indicates that 
Mahalanobis distance can fail from a computational point of view because the bootstrap 
covariance matrix will be singular. This proved to be the case, so the strategy here is to 
measure the depth of 0 using a method that does not require the use of a covariance matrix. 

For completeness, note that the issue of a singular covariance matrix could be avoided by 
using the first group as a reference group and testing Hq\ 6 i — 62 = di — 6 ^ = ■ ■ ■ = 6 i — 6j. 
In terms of a Type I error, this approach is reasonable, but in terms of power, this is not 
necessarily the case. The reason is that power can depend on which group is used as the 
reference group because the choice of the reference group impacts the magnitude of the 
differences between the medians that are compared. 

To describe the details of the proposed test of (1) via (4), let Xij be a random sample 
from the jth group {i = 1,... ,nj). Generate a bootstrap sample from the jth group by 
resampling with replacement nj observations from group j. Let 6 * be the estimate of the gth 
quantile for group j based on this bootstrap sample. Let S*i^ = 9* — j < k. Repeat this 
process B times yielding b = 1,... ,B. Here, B = 600 is used in order to avoid overly 
high execution time and because this choice has been found to provide reasonably good 
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control over the Type I error probability when dealing with related bootstrap techniques 
(e.g., Wilcox, 2012). However, in terms of power, there might be a practical advantage to 
using a larger choice for B (Racine & MacKinnon, 2007). 

A portion of the strategy used here is based on measuring the depth of a point in a 
multivariate data cloud using a projection-type method, which provides an approximation 
of half-space depth (Wilcox, 2012, section 6.2.5). For notational convenience, momentarily 
focus on an n X p matrix of data, Y. Let f be some measure of location based on Y. For 
simplicity, the marginal medians (based on the usual sample median) are used. Let 

U, = Yi - f 


(i = l,...,n), 

and for any j (j = 1,..., n), let 


and 


Q = U*U' 


k=l 

rp _ / Tj Tj \ 

-‘^ij ^ • • • ) '-'ip) 


^ij ~ 11^0 il) 


(5) 


where ||Tjj|| is the Euclidean norm associated with the vector (i = 1,. .. n; j = 1,. .., n). 
Let 

d - = 

qi 2 - qa 

where qi 2 and qn are estimates of the upper and lower quartiles, respectively, based on 
Dii ,..., Din. Here, qi 2 and qn are estimated with the so-called ideal fourths (e.g., Friqqe et 
ah, 1989.), which are computed as follows. Let j be the integer portion of (n/4) -|- (5/12) 
and let 

n 5 

'*=4 + 12 -^' 

The lower quartile is estimated with s 




( 6 ) 



where -Di(i) < ■ ■ ■ < Di(^n)- Letting k = n — j + 1, the upper quartile is 

Qi2 (1 h)-Dj(fc) -|- (7) 

The projection distance of Yj, relative to the cloud of points represented by Y, is the 
maximum value of dij, the maximum being taken over i = 1,... ,n. This measure of depth 
is nearly the same as the measure derived by Donoho and Gasko (1992). the only difference 
is that they used the median absolute difference (mad) as a measure of scale rather than the 
interquartiles range. MAD has a higher breakdown point but using the interquartile range 
has been found to perform better in various situations (Wilcox, 2012). This might be due 
to the poor efficiency of mad, but the extent this is the case is unclear. Perhaps using mad 
would perform well in the simulations reported here, but this is left to future investigations. 

Now create a {B + 1) x L matrix G where the first B rows are based on the 5^^^, 
b = 1,..., B, L = {.P — J)/2. That is, row b consists of the L values associated with 
for all j < k. Row R + 1 of G is a vector 0 having length L. Then from general theoretical 
results in Liu and Singh (1997), a (generalized) p-value can be computed based on the relative 
distance of 0. Compute the projection distance for each row of G. The distance associated 
with the 6th row is denoted by Ki, and the distance for the null vector (row B + 1) is denoted 
by Kq. Then a generalized p-value is 

^ 6=1 

where the indicator function I{Kq > Kb) = 1 if Kq > Kb, otherwise I{Kq > Kb) = 0. This 
will be called method Q. 

3 Simulation Results 

Simulations were used to study the small-sample properties of method Q when there are 
J = 4 groups. Results are reported when comparing medians as well as the lower and 
upper quartiles. Estimated Type I error probabilities, a, were based on 4000 replications. 
Both continuous and discrete distributions were used. The four continuous distributions 
were normal, symmetric and heavy-tailed, asymmetric and light-tailed, and asymmetric and 
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Table 1: Some properties of the g-and-h distribution. 


g 

h 

Ki 

«2 

0.0 

0.0 

0.00 

3.0 

0.0 

0.2 

0.00 

21.46 

0.2 

0.0 

0.61 

3.68 

0.2 

0.2 

2.81 

155.98 


heavy-tailed. More precisely, four g-and-h distributions were used (Hoaglin, 1985) that 
contain the standard normal distribution as a special case. If Z has a standard normal 
distribution, then 


W = 


expy)-i exp(hZV2), ifc/>0 
Zexp(hZ^/2), if g = 0 

has a g-and-h distribution where g and h are parameters that determine the hrst four mo¬ 
ments. The four distributions used here were the standard normal {g = h = 0.0), a symmetric 
heavy-tailed distribution {h = 0.2, g = 0.0), an asymmetric distribution with relatively light 
tails {h = 0.0, g = 0.2), and an asymmetric distribution with heavy tails {g = h = 0.2). Ta¬ 
ble 1 shows the skewness (ki) and kurtosis (^ 2 ) for each distribution. Additional properties 
of the g-and-h distribution are summarized by Hoaglin (1985). 


As for situations where tied values can occur, consider a discrete distribution with a 
sample space having cardinality N. A goal in the simulations was to get some sense about 
how well method Q controls the Type I error probability when N is small. Roughly, as 
the likelihood of tied values increases, at what point does method Q break down? Here, 
results are reported when data are generated from a beta-binomial distribution for which 
the cardinality of the sample space is N = m + 1 = 11 and N = m + 1 = 21. The choices 
for (r, s) were (3, 3), which has a symmetric distribution, as well as (1, 3) and (1,9), which 
are skewed distributions. 


First consider the four g-and-h distributions when testing at the .05 level and the groups 
have a common sample size n = 20. As indicated in Table 2, the estimated Type I error 
probability ranges between .025 and .062. Although the importance of a Type I error depends 
on the situation, Bradley (1978) suggests that as a general guide, when testing at the .05 
level, the actual level should not drop below .025 or exceed .075. In Table 2, the estimates 
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Table 2: Estimated Type I Error Probability using method Q, continuous distributions, 


q 

9 

h 

n = 20 

n = 50 

0.25 

0.0 

0.0 

0.058 

0.059 

0.25 

0.0 

0.2 

0.031 

0.046 

0.25 

0.2 

0.0 

0.061 

0.058 

0.25 

0.2 

0.2 

0.038 

0.055 

0.50 

0.0 

0.0 

0.059 

0.061 

0.50 

0.0 

0.2 

0.046 

0.057 

0.50 

0.2 

0.0 

0.062 

0.062 

0.50 

0.2 

0.2 

0.054 

0.056 

0.75 

0.2 

0.0 

0.049 

0.054 

0.75 

0.2 

0.2 

0.025 

0.038 


were in this range. 

A possible criticism of the results in Table 2 is that they are based on only 4000 repli¬ 
cations. Consequently, some comments about the precision of the estimates in Table 2 are 
provided. Assuming Bradley’s criterion is reasonable, consider the issue of whether the ac¬ 
tual level is less than or equal .075. Using the method in Pratt (1968), it can be seen that 
based on a two-sided .95 confidence interval for the actual level, the confidence interval will 
not contain .075 if d < .06675. All of the results in Table 2 suggest that the actual level 
does not exceed .075. Using instead a .99 confidence interval for the actual level, a < .06425 
indicates that the actual level does not exceed .075. In a similar manner, based on a two- 
sided .95 conhdence interval, the conhdence interval for the actual level does not contain 
.025 if d > .03025. For a .99 conhdence interval, d > .032 is required and there are only two 
situations where the estimate is less than .032 which occurred for n = 20 when comparing 
the quartiles. 

For normal distributions, a simulation was run with n = 100 as an additional check on 
how the method performs as n gets large. The estimated Type I error probability was .058. 

For the beta-binomial distributions, estimated Type I error probabilities are shown in 
Table 3. For m = 20, control over the Type I error probability is reasonably good when 
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Table 3: Estimated probability of a Type I error using method Q when sampling from a beta- 
binomial distribution for a sample sizes n = 20 and 50, a = .05, and where the cardinality 
of the sample space is iV = m -|- 1. 


q 

r 

s 

(n, m) = (20,10) 

{n,m) = (20, 20) 

(n,m) = (50,10) 

(n,m) = (50,20) 

0.25 

3 

3 

0.074 

0.071 

0.068 

0.063 

0.50 

3 

3 

0.066 

0.070 

0.061 

0.058 

0.25 

1 

3 

0.056 

0.052 

0.094 

0.048 

0.50 

1 

3 

0.059 

0.062 

0.060 

0.060 

0.75 

1 

3 

0.078 

0.085 

0.065 

0.067 

0.25 

1 

9 

0.008 

0.058 

0.000 

0.067 

0.50 

1 

9 

0.088 

0.052 

0.154 

0.050 

0.75 

1 

9 

0.061 

0.064 

0.069 

0.054 


comparing medians. But for m = 10, it is evident that control over the probability of a Type 
can be unsatisfactory, particularly when (r, s) = (1,9). The fact that method Q does not 
perform well for this particular distribution is not surprising because the .47 quantile is zero. 
When comparing the quartiles with m = 20, method Q can be unsatisfactory with n = 20, 
the highest estimate of actual level being .085. For this particular situation, increasing the 
sample size of two of the groups to 40, the estimate is .064. With all sample sizes equal to 
30 and m = 10, the estimate is .065. 

Precise details regarding the rate of convergence to the nominal level is impossible with 
only 4000 replications, but it is evident that the rate of convergence can depend on the 
nature of the distribution. Among the discrete distributions considered here for which the 
cardinality of the sample space is 21, n > 20 suffices in terms of achieving an estimated Type 
I error probability reasonably close to a nominal .05 level when comparing the medians. But 
for m = 10 (the cardinality of the sample space is 11) and (r, s) = (1, 9), n > 180 is required 
when comparing medians. With n = 100, for example, the estimate of the actual level 
exceeds .09, in which case the .95 conhdence interval for the actual level does not contain 
.075. 

A few simulations were performed using a discrete distribution where the null hypothesis 
is true but not all of the distributions are identical. All indications are that when comparing 
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medians, again the Type I error probability is controlled reasonably well when n = m = 20. 
Consider, for example, a beta-binomial distribution where r = s = 3. Then 11 is the 
.54 quantile. Now, suppose that for all four groups data are generated from a discrete 
distribution such that P{X < x) corresponds to a beta-binomial distribution where r = s = 3 
provided that x < 11, but that otherwise some of the cumulative distributions differ. So the 
distributions are not all identical in the right tail, but the hypothesis of equal population 
medians is true. Consider in particular the situation where for three of the groups the 
probability function, say /(x), corresponds to a beta-binomial probability function when 
X < 15. Otherwise 

f(x) = f; p(x)n 

x=15 

when X > 15, where again P(x) indicates a a beta-binomial distribution given by (|^. Now 
the estimated Type I error probability when comparing the population medians is .057. If 
instead for x = 15,16,... ,21 /(x) is taken to be P(21), P(20),... ,P(15), respectively, the 
estimated Type I error probability is .048. However, when comparing the lower quartiles, 
now control over the Type I error probability exceeds .09. Increasing the sample sizes to 
40, this problem persisted. With all sample sizes equal to 50, control over the Type I error 
probability is reasonably good, the estimate being .064. 


4 An Illustration 

Method Q is illustrated using data from the Well Elderly 2 study (Jackson, et ah, 2009; 
Clark et ah 1997). Generally, the study dealt with the efficacy of a particular intervention 
strategy aimed at improving the physical and emotional health of older adults. One partic¬ 
ular issue was whether four educational groups differed in terms of a measure of meaningful 
activity prior to intervention. The four groups were high school graduate, some college or 
technical school, 4 years of college and post-graduate school. For convenience, these groups 
are designated Gl, G2, G3 and G4 henceforth. Meaningful activity was measured with the 
sum of 29 Likert items, where the possible responses for each item were 0, 1, 2, 3 and 4. 
Higher scores reflect higher levels of meaningful activities. The sample sizes were 62, 81, 110 
and 125, respectively. 
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Figure 2: Boxplots for the four groups compared in the Well Elderly 2 study. Gl=high school 
graduate, G2=some college or technical school, G3=4 years of college and G4=post-graduate 
school. 

Figure 2 shows boxplots for each of the four groups, which suggests that more pronounced 
differences occur based on the lower quartile compared to upper quartile. Applying method 
Q, the p-values corresponding to the .25, .5 and .75 quantiles were 0, .074 and .294, respec¬ 
tively. So in terms of participants who score relatively high on meaningful activities, no 
signihcant difference is found, but a signihcant result is found for low levels of meaningful 
activity as reflected by the .25 quantiles. 
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5 Concluding Remarks 


All indications are that if the cardinality of the sample space is > 20 and the sample 
size is n > 30, reasonably good control over the Type I error probability will be achieved 
using method Q when the goal is to compare the population medians. When comparing the 
quartiles, now n > 50 might be required. Of course, simulations do not guarantee that this 
will be the case for all practical situations that might be encountered. But the main point is 
that no other method has been found that performs even tolerably well in simulations when 
tied values are likely to occur, except for the special case of J = 2 groups. 

There are many variations of the method used here. For example, in various situations, 
weighted bootstrap methods have been suggested when dealing with robust estimators; see 
for example Salibian-Barrera and Zamar (2002) and the papers they cite. There are several 
alternatives to the Harrell-Davis estimator that use a weighted sum of all the order statistics, 
and there are variations of the projection distance that was used, perhaps there are situations 
where some combination of these methods provide a practical advantage over the method 
used here, but this remains to be determined. The main point is that the method in the 
paper is the only known method that continues to perform reasonably well when there are 
tied values. 

Finally, method Q can be applied with the R function Qanova, which has been added to 
the Forge R package WRS. This function is also stored in the hie Rallfun-v28, which can be 
downloaded from http://dornsife.usc.edu/cf/labs/wilcox/wilcox-faculty-display.cfm, 
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